[Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path by AndreasKaratzas · Pull Request #38504 · vllm-project/vllm

AndreasKaratzas · 2026-03-30T04:44:16Z

Three issues prevented test_gpt_oss_triton_kernels and GPQA serving from working on gfx950 (CDNA4) after the legacy routing deprecation:

pack_bitmatrix sets bit 31 spuriously on HIP: Triton on HIP uses C-style integer division: -1 // 32 == 0 (not -1 as in Python). The masked-out lanes loaded with other=-1 produce div=0, and 1 << (-1 % 32) is undefined behavior that sets bit 31 on AMD. This causes expert 31 to appear in every bitmatrix row, corrupting the SparseMatrix routing metadata and dropping valid token-expert pairs. On the serving path this results in GPU memory access faults that crash the server. Fixed by adding a valid = indices >= 0 guard. On CUDA this is a no-op since Python-style floor division gives div=-1 which never matches any column offset.
Test padding alignment too small for CDNA4: CDNA4MXScaleLayout.swizzle_data reshapes with SCALE_K // 8, requiring K % 256 == 0. The test used round_up(K, 64) which is only sufficient on Hopper (where HopperMXScaleLayout pads internally). Made alignment platform-conditional: ROCm uses 256/512 matching production values in mxfp4_round_up_hidden_size_and_intermediate_size; CUDA keeps the original 64/128.
gfx950 GPQA eval configs missing tokenizer and TP: The amd/gpt-oss-20b-w-mxfp4-a-bf16 model repo declares tokenizer_class=TokenizersBackend which is not a valid HuggingFace tokenizer, causing server launch failure. Added --tokenizer openai/gpt-oss-20b. Also added --tensor-parallel-size 2 to all ROCm configs to prevent GPU memory faults with enforce-eager.

Root cause detail

pack_bitmatrix loads topk_ids with other=-1 for out-of-bounds lanes. On HIP/ROCm:

-1 (int16) // 32 = 0      # C-style truncation toward zero
-1 (int16) %  32 = -1     # C-style remainder
1 << -1          = 1 << 31 # undefined behavior, sets bit 31 on AMD

This silently inserts expert 31 into the bitmatrix for every token, which corrupts col_sum (expert 31 gets count n_rows instead of 0), causing col_sorted_indx to have -1 entries that drop valid token-expert pairs from the computation. During serving, these corrupted indices lead to out-of-bounds memory accesses that crash the GPU.

Test plan

test_gpt_oss_triton_kernels.py: 5/5 passed (was 1 passed, 4 failed)
GPQA accuracy eval (gfx950 MXFP4 configs): 3/3 passed (was 1 passed, 1 GPU crash, 1 not reached)

[LEGACY] Initial Test plan and Motivation

legacy_routing used legacy_routing_from_sparsematrix which extracted col_sorted_indx directly from the topk() SparseMatrix, producing indices that differ from the bitmatrix-reconstructed path (missing -1 padding for empty expert slots). In this PR we align legacy_routing with the same code path used by triton_kernel_moe_forward and make_routing_data: unpack topk() results and route through legacy_routing_from_bitmatrix (tuple/ROCm) or make_routing_data (SparseMatrix/NVIDIA).

Test plan

pytest tests/kernels/moe/test_gpt_oss_triton_kernels.py::test_legacy_routing -v

Motivation: https://buildkite.com/vllm/amd-ci/builds/7062/steps/canvas?jid=019d382d-cbac-41c4-a53f-f4f56244488e&tab=output

cc @kenroche

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gemini-code-assist

Code Review

This pull request updates the MoE routing logic to handle bitmatrix-based routing and introduces a mechanism in the scheduler to defer block freeing during asynchronous KV transfers. Specifically, it adds a _deferred_block_req_ids set to track these requests and ensures the engine continues stepping while transfers are pending. I have no feedback to provide.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

tjtanaa · 2026-03-31T14:20:51Z


    def has_finished_requests(self) -> bool:
-        return len(self.finished_req_ids) > 0
+        return len(self.finished_req_ids) > 0 or len(self._deferred_block_req_ids) > 0


are these changes from the other PR #38503 ?

Yep, you're right, I reverted that.

…outing Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…outing

tjtanaa · 2026-04-03T09:28:11Z

    if sm_first:
        logits = torch.softmax(logits, dim=-1)
    sparse_logits = topk(logits, n_expts_act, apply_softmax=not sm_first)
-    return legacy_routing_from_sparsematrix(


after you remove this legacy_routing_from_sparsematrix is not used any more. Let's remove it.

Moreover, please disclose the accuracy of the models.

I am not familiar with this part of the code.

CC @robertgshaw2-redhat @bnellnm

I removed that legacy routing logic. I enabled also GPQA Eval tests too for GPT-OSS accuracy.

…outing

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

tjtanaa · 2026-04-04T02:48:44Z

Does it affect performance?

AndreasKaratzas · 2026-04-04T02:52:12Z

@tjtanaa Actually I was about to cite the tests regarding accuracy, but think I got to rebase first to get clean logs: https://buildkite.com/vllm/amd-ci/builds/7267/steps/canvas?sid=019d553e-778b-452a-bde4-8882ae4cdcf2&tab=output

…outing

…0 eval configs Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas · 2026-04-05T06:54:58Z

GPQA accuracy (`amd/gpt-oss-20b-w-mxfp4-a-bf16`, TP=2, gfx950, threshold=0.568, tol=0.05)

Config	main	PR branch
mxfp4-bf16-aiter	PASSED (0.5739)	PASSED (0.5600)
mxfp4-bf16-triton	PASSED (0.5701)	PASSED (0.5726)
mxfp4-fp8-triton	PASSED (0.5562)	PASSED (0.5593)
baseline	PASSED (0.5707)	PASSED (0.5530)

Serving benchmark (`amd/gpt-oss-20b-w-mxfp4-a-bf16`, TP=2, gfx950, 512 prompts, rate=10)

1024 in / 1024 out

Metric	main triton	main aiter	PR triton	PR aiter
Successful / Failed requests	512 / 0	512 / 0	512 / 0	512 / 0
Output token throughput (tok/s)	392.62	540.17	467.22	527.19
Mean TTFT (ms)	69.90	38.70	80.01	37.85
Mean TPOT (ms)	109.03	54.68	123.46	54.23
Request throughput (req/s)	6.86	9.46	8.15	9.53

8192 in / 8192 out

Metric	main triton	main aiter	PR triton	PR aiter
Successful / Failed requests	512 / 0	512 / 0	512 / 0	512 / 0
Output token throughput (tok/s)	309.76	337.94	252.00	472.75
Mean TTFT (ms)	628.07	347.65	699.16	343.38
Mean TPOT (ms)	329.76	162.47	389.91	164.78
Request throughput (req/s)	4.05	4.63	4.04	7.26

AndreasKaratzas · 2026-04-05T07:45:34Z

GPT-OSS tests pass:

tjtanaa

LGTM

…vllm-project#38504) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Jacob Lou <jacoblou0924@gmail.com>

…vllm-project#38504) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

…vllm-project#38504) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas added 2 commits March 29, 2026 23:38

[Engine] Fix engine stalling when async KV offload defers block frees

e3cf764

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Fix legacy_routing to use bitmatrix-based routing path

8c08e6e

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

mergify Bot added gpt-oss Related to GPT-OSS models v1 labels Mar 30, 2026

github-project-automation Bot added this to gpt-oss Issues & Enhancements Mar 30, 2026

github-project-automation Bot moved this to To Triage in gpt-oss Issues & Enhancements Mar 30, 2026

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

AndreasKaratzas changed the title ~~Akaratza fix legacy routing~~ [Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path Mar 30, 2026

AndreasKaratzas added the rocm Related to AMD ROCm label Mar 30, 2026

github-project-automation Bot added this to AMD Mar 30, 2026

github-project-automation Bot moved this to Todo in AMD Mar 30, 2026

AndreasKaratzas marked this pull request as ready for review March 30, 2026 17:11

AndreasKaratzas requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, mgoin, njhill, orozery, pavanimajety, robertgshaw2-redhat and ywang96 as code owners March 30, 2026 17:11

claude Bot reviewed Mar 30, 2026

View reviewed changes

AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 30, 2026

AndreasKaratzas requested a review from DarkLight1337 March 31, 2026 05:24

tjtanaa reviewed Mar 31, 2026

View reviewed changes

AndreasKaratzas added 3 commits March 31, 2026 13:30

Merge remote-tracking branch 'origin/main' into akaratza_fix_legacy_r…

c7fa8c0

…outing Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Reset scheduler changes to origin/main, keep only GPT OSS changes

e302ad2

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'origin/main' into akaratza_fix_legacy_r…

eb5c162

…outing

tjtanaa reviewed Apr 3, 2026

View reviewed changes

AndreasKaratzas added 2 commits April 3, 2026 15:23

Merge remote-tracking branch 'origin/main' into akaratza_fix_legacy_r…

bb7e1f0

…outing

[GPT] Deprecate legacy routing

86f5166

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas requested review from tlrmchlsmth and yewentao256 as code owners April 3, 2026 21:27

AndreasKaratzas added 3 commits April 3, 2026 21:52

Merge remote-tracking branch 'origin/main' into akaratza_fix_legacy_r…

0c99ffe

…outing

Merge remote-tracking branch 'origin/main' into akaratza_fix_legacy_r…

7776b13

…outing

[ROCm] Fix pack_bitmatrix spurious bit, CDNA4 test padding, and gfx95…

1618fb6

…0 eval configs Signed-off-by: Andreas Karatzas <akaratza@amd.com>

tjtanaa approved these changes Apr 7, 2026

View reviewed changes

github-project-automation Bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Apr 7, 2026

tjtanaa merged commit 2df2c85 into vllm-project:main Apr 7, 2026
66 checks passed

github-project-automation Bot moved this from Todo to Done in AMD Apr 7, 2026

github-project-automation Bot moved this from Ready to Done in gpt-oss Issues & Enhancements Apr 7, 2026

AndreasKaratzas deleted the akaratza_fix_legacy_routing branch April 7, 2026 02:58

varun-sundar-rabindranath mentioned this pull request Apr 9, 2026

[Bugfix][RoCM] GPT-OSS + Expert Parallel #35791

Closed

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path (…

ffc583c

…vllm-project#38504) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path#38504

[Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path#38504
tjtanaa merged 10 commits intovllm-project:mainfrom
ROCm:akaratza_fix_legacy_routing

AndreasKaratzas commented Mar 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

tjtanaa Mar 31, 2026

Uh oh!

AndreasKaratzas Mar 31, 2026

Uh oh!

tjtanaa Apr 3, 2026

Uh oh!

AndreasKaratzas Apr 3, 2026

Uh oh!

tjtanaa commented Apr 4, 2026

Uh oh!

AndreasKaratzas commented Apr 4, 2026

Uh oh!

AndreasKaratzas commented Apr 5, 2026

Uh oh!

AndreasKaratzas commented Apr 5, 2026

Uh oh!

tjtanaa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

AndreasKaratzas commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause detail

Test plan

[LEGACY] Initial Test plan and Motivation

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

tjtanaa Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

tjtanaa Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Apr 4, 2026

Uh oh!

AndreasKaratzas commented Apr 4, 2026

Uh oh!

AndreasKaratzas commented Apr 5, 2026

GPQA accuracy (amd/gpt-oss-20b-w-mxfp4-a-bf16, TP=2, gfx950, threshold=0.568, tol=0.05)

Serving benchmark (amd/gpt-oss-20b-w-mxfp4-a-bf16, TP=2, gfx950, 512 prompts, rate=10)

1024 in / 1024 out

8192 in / 8192 out

Uh oh!

AndreasKaratzas commented Apr 5, 2026

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AndreasKaratzas commented Mar 30, 2026 •

edited

Loading

GPQA accuracy (`amd/gpt-oss-20b-w-mxfp4-a-bf16`, TP=2, gfx950, threshold=0.568, tol=0.05)

Serving benchmark (`amd/gpt-oss-20b-w-mxfp4-a-bf16`, TP=2, gfx950, 512 prompts, rate=10)